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KRONECKER SUBSTITUTION 



DAVID HARVEY 

Abstract. We give several new algorithms for dense polynomial multiplica- 
tion based on the Kronecker substitution method. For moderately sized input 
polynomials, the new algorithms improve on the performance of the standard 
Kronecker substitution by a sizeable constant, both in theory and in empirical 
tests. 



1. Introduction 

The Kronecker substitution method is an algorithm for computing the product 
of two polynomials. The basic idea was originally suggested by Kronecker [5] as 
a means of reducing problems concerning multivariate polynomials to univariate 
polynomials; later Schonhage [4 suggested a similar idea to reduce multiplication 
in Z[x] to multiplication in Z. The technique is widely used in practice: for exam- 
ple, the Magma computer algebra system uses Kronecker substitution to multiply 
polynomials in Z[x] in some cases [5J, and Victor Shoup's NTL library [5] uses 
Kronecker substitution to reduce multiplication in GF(p n )[a;] to multiplication in 
GF(p)[x]. 

The reduction from Z[x] to Z is best explained by an example. To compute the 
product 

h(x) = (621x 3 + 887a; 2 + 610a; + 274) x (790a; 3 + 424a; 2 + 298a; + 553), 
one evaluates the polynomials at x — 10 , and then computes the integer product 
h(l0 7 ) = 621000088700006100000274 x 790000042400002980000553 
= 490590096403410430461082839078846704189820151522. 

The coefficients of h(x) may then be read off from the digits of h(10 7 ), since the 
choice of evaluation point ensures that the coefficients do not overlap: 

h(x) = 490590a; 6 +964034a; 5 +1043046a; 4 +1082839a; 3 +788467a; 2 +418982x+151522. 

Of course, on real hardware, one evaluates at 2^ rather than 10 N , since then 
the packing and unpacking phases are accomplished efficiently (in linear time) by 
various bit shifting and masking operations. 

The main advantage of this algorithm for multiplication in Z[a;] is that it places 
the burden of computation on existing highly optimised software libraries for mul- 
tiprecision integer arithmetic, such as the GMP library This point of view is 
discussed further in [T]. However, the algorithm also has a disadvantage: it intro- 
duces unwanted zero-padding. In the example above, if one computes the integer 
product by hand using the classical algorithm, about three-quarters of the digit-by- 
digit products involve multiplying by zero, because about half of the input digits 
are zero. Clearly it is desirable to somehow skip these redundant multiplications. 
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In this paper we give several new algorithms that mitigate this inefficiency, with- 
out sacrificing the advantage mentioned above. The basic idea is to evaluate not 
just at a single point, but at several points; instead of reducing to a single multi- 
plication, we reduce to several smaller multiplications. The evaluation points are 
chosen carefully so that the packing and unpacking phases are still highly efficient. 

The benefit of this approach accrues as follows. Let M(n) denote the time 
required to multiply rt-bit integers. Suppose that the standard Kronecker scheme 
reduces a given polynomial multiplication problem in Z[x] to a multiplication of 
two 6-bit integers in Z. The cost is M(b), plus 0(b) overhead associated with 
the packing and unpacking steps. One of our new algorithms (j ]3.4| reduces this 
problem instead to four multiplications of size about 6/4, so the cost becomes 
4M(b/4)+0(b). 

The relation between M(b) + 0(b) and 4M(b/4) + 0(b) depends on the underlying 
integer multiplication algorithm, and on the implied constants in the 0(b) terms. If 
b is very large, then one has available FFT-based methods for integer multiplication, 
and M(b) behaves roughly like b log b, leaving little difference between the two 
strategies. However, if b is relatively small, then M(b) will typically behave like b a 
for some a > 1; for example, GMP version 4.2.2 implements classical multiplication 
(a = 2), Karatsuba multiplication (a ~ 1.58), and Toom-Cook 3- way multiplication 
(a » 1.46). In this situation we expect the new algorithm to win by a factor of about 
4 Q_1 . Under the classical multiplication regime, the theoretical gain is a factor of 
4 2_1 = 4, which is equivalent to 'skipping' all of the redundant multiplications by 
zero in the example given above. Finally, when b is sufficiently small, we expect the 
0(b) overhead to dominate, and the usual Kronecker substitution becomes faster, 
simply because it has a smaller constant in the 0(b) term. For the implementation 
discussed in fj4j we find that the 0(b) term already interferes in the region where 
a is somewhat less than 2, so we never quite achieve the factor of four speedup 
suggested by the above theoretical analysis. 

Another interesting feature of the new algorithms is that they are trivially par- 
allclisable, since they reduce the original problem to several independent multipli- 
cations. For example, using the algorithm described in |3.4| one can easily split a 
large multiplication problem in Z[x] into four threads, without needing to parallelise 
any of the internals of the integer multiplication routine. 

The organisation of this paper is as follows. In Sj2]we first consider the technically 
simplest case of reducing multiplication in R[x,y] to multiplication in R[x]. The 
standard Kronecker substitution evaluates at y = x N for a suitably large N . We 
give three new algorithms. The first algorithm evaluates at y = x N and y — 
x~ N , where N' is about half the size of N. The second algorithm evaluates at 
y = x N and y = —x N , but only works for rings in which the multiply-by-two 
map is injective. The third algorithm combines these two algorithms, evaluating 
at four points y — x N , x~ N , — x N and — x~ N , where N" is about N/4. In 
Sj3]we adapt these algorithms to the case of the substitution from Z[x] to Z. The 
phenomenon of carries in integer arithmetic (the archimedean property of Z) makes 
this slightly more complicated than the bivariate case. Finally in Sj3]we give some 
examples of timings of an implementation of the algorithms for the specific problem 
of multiplication in (Z/nZ)[x]. We observe that the 'four-point' variant is almost 
twice as fast as the standard Kronecker substitution, over a large range of problem 
sizes. 



MULTIPOINT KRONECKER SUBSTITUTION 



3 



2. The polynomial case 

Let R be a commutative ring with identity. We will regard a polynomial p £ R[x] 
as a vector of coefficients of a certain known length I (i.e. the coefficient of x £_1 is 
permitted to be zero). Similarly a bivariate polynomial p £ R[x, y] will be treated 
as a rectangular array of coefficients, with a certain length l x with respect to x and 
a certain length £ y with respect to y. We write such a, p a,s p — X^=o^ _1 Pi{ x )y l i 
where each pi £ R[x] has length i(pi) — £ x (p). 

Throughout this section we fix two polynomials f,g £ R[x,y\; we are interested 
in computing their product h = fg. For simplicity we assume that t x {f) = ^x(g) 
and that i y (f) = i v (g)- We denote these by L x and L y respectively, and assume 
that L x > 1 and L y > 1. We then have ^ x (/i) = 2L X — 1 and ^(/i) = 2L y — 1. It 
is not difficult to adapt all of the algorithms below to the case where / and g have 
different lengths. 

2.1. The standard Kronecker substitution. Let N = 2L X — 1. We evaluate at 
y = x N , that is, we compute 

L y -1 

X 

i=0 

as an element of R[x]. Since N > L x , this evaluation step consists simply of writing 
down the coefficients of /o, /i, ■ • • , /z,„-i, with L x — 1 zeroes between /j and 
The polynomial /(x, x w ) has length (L y — l)N + L x = 2L x L y — L x — L y + 1. We 
evaluate similarly for 5, and then multiply in R[x] to obtain h(x,x N ) = f(x,x N ) ■ 
g(x, x N ). Note that 

2L s -2 

/i(x,x JV )= Hx)x lN , 

i=0 

and ^(/ij) = 2L X — 1 = N for each i. Therefore the coefficients of /i, do not overlap 
those of h i+ i in h(x, x N ) for any i, so it is easy to read off the coefficients of h(x, y). 
We obtain: 

Proposition 1. T/ie standard Kronecker substitution reduces the problem of com- 
puting h = fg to multiplying two polynomials of length 2L x L y — L x — L y + 1 in 
R[x\. 

2.2. Reciprocal evaluation points. In our first variant, we evaluate at y — x N 
and x~ N , where TV = L x . We obtain 

l v -i 

f(x,x N )= fi(*)x iN , 
x N ^-Vf(x,x- N ) = ^ f t (x)x^-^ N = f Ly ^ t (x)x^ . 

4 = 1=0 

(The normalising factor x N ( Ly ~ 1 ' > ensures that we have an element of R[x] rather 
than of R[x, x" 1 ].) 

Since N is exactly L x , these two evaluations arc obtained by concatenating the 
coefficients /o, /1, ■ • • , fh y -\ directly, with no zero-padding in between; they differ 
only in the order of concatenation. 
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We perform similar evaluations for g, and then compute the products 

h(x ) x N )=f(x ) x N )-g(x,x N ), 

x 2N ^y-^h(x, x~ N ) = x^" 1 ) f(x, x~ N ) ■ x N ^-^g(x, x~ N ). 

These are both multiplications of polynomials in R[x] of length NL y = L x L y . 

We must now show how to recover the coefficients of h(x, y) from knowledge of 
the two polynomials 

2L„-2 

h(x,x N )= h i( x ) xiN > 

2L y -2 

a ?Wv-V h (x,X- N )= ]T h 2Ly .2-i{x)x iN . 
i=0 

Note that each hi has length 2L X — 1 = 2 N — 1 , so the coefficients of hi generally 
do overlap those of hi+i, in both sums. However, there are two exceptions: 

• The lowest N terms of h(x,x N ) are precisely the lowest N terms of ho- 

• The highest N — 1 terms of x 2N ( Ly ~^ h(x, x~ N ) are precisely the highest 
N — 1 terms of h Q . 

We therefore completely recover ho by gluing together these two halves. Then we 
subtract ho from the appropriate position in both sums, revealing the two halves 
of hi. We repeat this procedure for h±, h 2 , ■ ■ ■ , ^2i„-2 to completely determine 
h(x,y). 

We obtain the following: 

Proposition 2. The 'reciprocal' Kronecker substitution reduces the problem of com- 
puting h = fg to two multiplications of polynomials of length L x L y in R[x], plus 
0(L x L y ) subtractions in R. 

2.3. Negated evaluation points. The next algorithm only works for rings in 
which the multiply-by-two map is injective, so for example the algorithm does not 
work over a fi eld of characteristic two. 

As in ^2.2 put N — L x . Evaluate at y = x N and — x N , obtaining 

Ly 1 Ly 1 

f{x, x N ) = J2 Mx)x lN > ~x N ) = E {-i)*Mx)x iN . 

i=0 i—0 

The first value is obtained by simply concatenating the coefficients of fo, fi, ■ ■ ■ , fh y -\ 
without any intervening zero-padding. The second value is obtained in the same 
way, but with the sign alternating on each chunk. 

Perform a similar evaluation on g, and then multiply to obtain h(x, ±x N ) = 
f(x,±x N ) ■ g(x,±x N ). As in the 'reciprocal' algorithm, these are both multiplica- 
tions of polynomials of length L x L y . 

Now decompose h into even and odd parts h^ and h^ as 

h(x,y)^h^(x,y 2 ) + yh^(x,y 2 ), 
where 40 (0) ) = = 2L X - 1, l y {h^) = L y and l y (h (1) ) = L y — 1. That is, 

Ly 1 Ly 2 

h(°\x,y) = £ h 2i (x)y\ h^(x,y) = £ h 2l+ i{x)y\ 

i=0 i=0 
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Wc find that h(x,±x N ) = h^(x, x 2N ) ± x N h^(x, x 2N ), and inverting the system 
we obtain 

/i (0) (x x 2N ) = xN ) + h ( x > ~ xN ) 

hWfrr rr^N\ _ h(x,X N ) - h(x,~X N ) 
It [X,X ) — 

Since £ x (hi) = 2L X — 1 < 2N, we are able to read off the coefficients of hn from 
h,(°\x, x 2N ), and those of /i2i+i from h^(x,x 2N ). Therefore we have: 

Proposition 3. The 'negated' Kronecker substitution reduces the problem of com- 
puting h = fg to two multiplications of polynomials of length L x L y in R[x], plus 
0(L x L y ) additions/subtractions in R and 0(L x L y ) divisions by 2 in R. 

2.4. Four evaluation points. We continue to assume that doubling is injective 
in R. The final algorithm we present takes advantage of the fact that the key ideas 
of the 'reciprocal' and 'negated' variants are essentially orthogonal, and may be 
combined. 

N „N „-N n „A „-N. 



Let N = \L x /2~\. We evaluate at y = x , — x , x and —.*■ 



Ly-1 



f(x,x N )= M*)* iN , 

t=0 
L y -1 

f(x 1 -x N )= ^i-iyfiix)^, 

i=0 

x N ^-Vf(x,x- N )= ]T K-i-i(x)x iN , 

j=0 
Ly-1 

N ^-^f( X ,-X- N ) = Y {-Ifhy-l-iix)^ 



A new phenomenon arises here: the coefficients fa overlap even in the evaluation 
phase, in all four of the above sums, so some additions and subtractions in R 
are required to compute them. Each of the four polynomials above has length 
N(L y -l) + L x . 

We evaluate similarly for g, and then multiply to obtain 

h(x,±x N )=f(x,±x N )-g(x,±x N ), 
x 2N ^-^h(x, ±x- N ) = Z'^- 1 ) f(x, ±x- N ) ■ x N ^-^g{x, ±x~ N ). 



As in { 2.3 we decompose h(x,y) into odd and even parts, 



h(x,y) = h^{x,y 2 ) + yh^{x,y 2 ), 
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h(x,x N ) + h(x,-x N ) 
2 ' 
h(x,x N )-h(x,-x N ) 
2x N 

x 2N ( L v-Vh(x, x- N ) + x 2N ^-^h(x, -x- N ) 
2 ' 
x 2N ^ L y-^h(x, x- N ) - x 2N ( L v-Vh(x, -x- N ) 
2x N 

(Note that i y {h^) = L y — 1, whence the normalising factor x 2N ( Ly ~ 2 K) 

Now consider h^(x, x 2N ) and x 2N(L y-^h^(x,x- 2N ). Since i{hi) = 2L X - 
1 < 47V, we may use the same strategy as in §2.2| first reading off the low 2N 
coefficients of ho from h^°'(x,x 2N ) and the high 2N — 1 coefficients of ho from 
x 2N ^ L "~^ h( ' (x, x~ 2N ). Iterating, we obtain h2, hi, . . . , /i2L„-2- Applying the same 
procedure to h^\x, x 2N ) and x 2N ^ L "~ 2 ^h^ (x, x~ 2N ), we obtain hi,ha, . . . , h 2 L y -3- 
Finally we have: 

Proposition 4. The 'four-point' Kronecker substitution reduces the problem of 
computing h = fg to four multiplications of polynomials of length \L x /2~\ [L y — 
1) + L x in R[x], plus 0(L x L y ) additions / subtractions in R and 0(L x L y ) divisions 
by 2 in R. 

2.5. Complex evaluation points. We mention here another variant, which does 
not at present appear to be competitive with the algorithms above, but may suggest 
further avenues for research. 

Put M — \L X /2~\ 1 and let S — R[i], where i is a primitive fourth root of unity. 
We evaluate at y = x M , —x M and ix M . The value f(x,ix M ) S S[x] has 'real' 
and 'imaginary' parts, both lying in R[x], so we have four polynomials in R[x] 
altogether. We compute the pointwise products, 

h(x,x M ) = f(x,x M )-g(x,x M ), 

h(x,-x M )=f(x,-x M )-g(x,-x M ) i 

h(x,ix M ) = f(x,ix M ) ■ g(x,ix M ). 

The first two multiplications are ordinary multiplications in R[x]; the last one 
is a 'complex' multiplication, and so requires three multiplications in R[x]. One 
checks that by taking appropriate linear combinations of these three products, the 
coefficients of h may be reconstructed. Unfortunately, one has committed five mul- 
tiplications instead of four, so the algorithm appears to be inferior to the algorithm 
of the previous section. 

Further variants are possible. For example, take S = R[lu], where uj is a primitive 
cube root of unity, and let M — \L x /3\. One may evaluate at y = ±x M and 
y = ztcux M , reducing the problem to eight multiplications in R[x] that are one-sixth 
the size of the multiplication generated by the standard Kronecker substitution. 

3. The integer case 

In this section we adapt the above algorithms to the case of the substitution from 
Z[x] to Z. This is mostly straightforward; the main complication is the management 
of carries in the reconstruction phase in the analogue of the 'reciprocal' algorithm. 



and we find that 

h^(x,x 2N ) = 
h^(x,x 2N ) = 
x 2N ^-Vh^{x,x- 2N ) = 
x 2N ^- 2 ^h^(x,x- 2N ) = 
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Fix two polynomials /, g G Z[x], with product h. We assume that they have the 
same length L, and put e = |~log 2 L] . We write / = ^2 i= Q fix 1 , and similarly for 
g. The length of an integer n is defined to be 1 + [log 2 MJ (t ne number of bits 
in the binary representation of \n\). We assume that / and g have non-negative 
coefficients of length at most b for some integer b > 1. It should be possible to 
handle the case of signed coefficients using essentially the same techniques, but we 
have not checked the details. 

Note that the coefficients of the product h have length at most 2b + e. In fact, 
they satisfy the slightly stronger inequality 

(1) < hi < L(2 b - l) 2 < 2 e (2 2b - 2 b+1 + 1), 

which we will need in §3.2| to control the propagation of carries. 

We assume that integers of length n may be added and subtracted in time 0{n), 
and that we may divide by a power of two in time 0(n). We also assume that 
we can 'pack' and 'unpack' binary strings in linear time. More precisely, given a 
list of integers ao, . . . , afc-i satisfying < a, < 2 C for some integer c, we require 
that we can construct the sum X^i-o 1 a%2 lc in time O(kc), and given this packed 
representation, we require that we can reconstruct the sequence of in time O(fcc). 

3.1. The standard Kronecker substitution. Let N = 2b + e, so that the coeffi- 
cients of h have length at most N. Evaluate at x — 2 N to obtain /(2 ) and g(2 N ), 
multiply to obtain h(2 N ) = f(2 N )g(2 N ), and unpack h(2 N ) to obtain the h t . 

Proposition 5. The standard Kronecker substitution reduces the problem of com- 
puting h = fg to multiplying two integers of length {2b + e)(L — 1) + b, plus pack- 
ing/unpacking overhead of 0((2b + e)L). 

3.2. Reciprocal evaluation points. Let N = |~(2b + e)/2] =6 + [e/2]. We 
evaluate at x = 2 N and 2~ N : 

f(2 N ) = j2fa iN , 

i=0 

2 N { L-l )f{2 - N) = Y^ f . 2{L -X-i)N = 'y-J, , /2 -\ 

2 = 2 = 

Note that there are only [e/2] bits of zero-padding between adjacent coefficients in 
the above sums. We evaluate similarly for g, and then compute the integer products 

h(2 N ) = f(2 N ) ■ g(2 N ), 

2 2N(L-l) h p-N} = 2 N(L-l) f{2 - N) . 2 N ^ L -^ g {2- N ). 

Now we must show how to recover the hi from the two sums 

2L-2 

(2) h(2 N ) = J2 ^ 2i2V ' 

i=0 

(3) 2 2N ^h(2- N ) = 2 J^h 2L ^2 lN - 

i=0 

The hi overlap in both sums, so to retrieve them we must use a similar strategy to 
that described in §2.2| for the polynomial case. There are added complications due 
to the presence of carries, which we resolve as follows. 
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First we write both sums in base 2^. Note that f(2 N ) and g(2 N ) have length 
at most LN, so h(2 N ) has length at most 2LN. Therefore we may write 

2L-1 

(4) h(2 N ) = u ^ 



i=0 

where each digit Ui satisfies < Ui < 2 N . Similarly 2 2N ( L ~ 1%l h{2~ N ) has length at 



most 2NL, so we may write 

2L-1 

(5) 2 2N ^h{2~ N ) = J2 ^L-x-i? 



-,iN 

"J2L-l-i> 
i=0 

where < Wi < 2 N . Decompose each hi into two digits as 
h l = a l + 2 N /3 l , <i <2L-2. 



where 



< oti < 2 N , 

N 



(6) < A < 2 iV - 1. 

The latter inequality is equivalent to saying that hi < 2 N (2 N — 1), which follows 
from ([!]) since 



fH < 2 2b+e 


_ 2 b + e_ 


- 1 + 2 e 




_ 2 b+e 


-2 e (2 b - 1) 


< 2 2b+e 


-2 b+e 




<2 2N _ 


2 N . 





The various quantities we have introduced satisfy the following relations. From 
(U and @ we have ao = uq, and 

(7) A + a l+1 + 5, = u l+1 +2 N 5. l+1 , < i < 2L - 2, 

where 5q = ol-il—x — and where 5i+i G {0, 1} is the carry generated by the 
addition /3, + aj+i + Si. Similarly, from ^ and ([5| we have ct2L-2 = W2L-1, and 

(8) a l + f3 l+1 +e l+1 =w t+1 +2 N e l , -1 < i < 2L - 3, 

where e^l-2 — a -i = and £j G {0,1} is the carry generated by the addition 
a, + + e i+ i. Note that e_i = since we know that 2 2N(L ^h{2^ N ) fits into 
2L digits. 

Given the Ui and lUj, we solve these equations for and /3j (and incidentally 
for Si and £j) by the following iterative procedure. Start with ao = uq and a_i = 
Sq = £-i = 0. Now let < j < 2L — 3, and suppose that otj—i, aj, Sj and £j-i 
have been computed. From Q we have 

Pj+i + sj+i = (w j+ i -atj) + 2 N Sj. 

By ([6]), the left hand side is less than 2 N . Therefore, Ej = 1 if aj > Wj+i, and 
£j = otherwise. Taking (|8| modulo 2 N for i = j — 1, we deduce the value of 

/3j = lUj — ctj-i — £j (mod 2^). 

From ([7]) modulo 2^ we obtain Oj+i, 

Q!j+i = Mj+i — /3j — <5j (mod 2^), 
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and then 6j+i is obtained directly from ([7]). At this stage we have found a?, 
otj+i, arid £j, and so we may repeat the process, until we have determined 
ccoj ■ • ■ j OL2L-2 and /3q, ■ ■ • , P2L-3- Finally we obtain P2L-2 from Q with i — 2L — 3: 

/?2L-2 = 1"2L-2 - "2L-3 - &2L-2 (mod 2^). 

The hi are then reconstructed as hi — cti + 2 N fti. 

Proposition 6. The 'reciprocal' Kronecker substitution reduces the problem of com- 
puting h — fg to two multiplications of integers of length (6 + |~e/2])(L — 1) + b, 
plus packing/unpacking overhead of 0((2b + e)L). 

Example 7. We illustrate (in base ten) using the example from £jl] We have 
N = 4, so the evaluation points are x = 10 4 and x = 1CP 4 . Let 

f{x) = 621s 3 + 887a; 2 + 610a; + 274, 

g(x) = 790a; 3 + 424x 2 + 298a; + 553, 

h(x) = f(x)g(x). 

We have 

/(10 4 ) = 621|0887|0610|0274, 10 12 /(10" 4 ) = 274|0610|0887|0621, 
5 (10 4 ) = 790|0424|0298|0553, 10 12 .g(lCr 4 ) = 553|0298|0424|0790, 

with the vertical bars showing boundaries between base-10 4 digits. The pointwise 
products are 

Mio 4 ) = /(io 4 ) 5 (io 4 ) 

= 49|0686|4138|3154|2917|8508|8997|1522, 
10 24 /i(10- 4 ) = 10 12 /(10- 4 ) • 10 12 .g(l(r 4 ) 

= 15|1563|9060|8575|2943|3142|4083|0590. 

The low digit uo = 1522 of /i(10 4 ) is the bottom half ao of Hq. Comparing 1522 
to wi — 1563, we see that there was no carry, i.e. £o = 0. Thus the top half 
/3o is simply wo = 15, so ho = 151522. We will not follow through the rest of the 
algorithm in detail, but we can see that by subtracting 151522 from the appropriate 
positions in both values, we reveal the top and bottom halves of hi = 418982: 

h(l0 4 ) - 151522 = 49|0686|4138|3154|2917|8508|8982|0000, 

10 24 /i(lCT 4 ) - 10 24 1 5 1 522 = 00|0041|9060|8575|2943|3142|4083|0590. 

3.3. Negated evaluation points. Put N = b + |~e/2] and evaluate at 2 N and 
— 2 N . In f(2 N ), there are |~e/2] bits of zero-padding between adjacent coefficients; 
in f(—2 N ) the padding alternates between zero-padding and 'one-padding'. If one 
only has available a 'packing routine' for non-negative inputs, f(—2 N ) may be 
determined by first computing /(°)(2 2Ar ) and 2 N f (1 \2 2N ) separately, where / (0) 
and f^ are the even and odd parts of /, and then taking their difference. Note 
that f(—2 N ) may be negative, if the leading monomial of / has odd exponent. 

Proposition 8. The 'negated' Kronecker substitution reduces the problem of com- 
puting h = fg to two multiplications of integers of length (b + |~e/2])(L — 1) + b, 
plus packing /unpacking overhead of 0((2b + e)L). 
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Example 9. Wc illustrate with our running example. The evaluation points are 
x = 10 4 and x = -10 4 : 

/(10 4 ) = 887|00000274 + 621 100000610|0000 = 621088706100274, 

/(-lO 4 ) = 887|00000274 - 621 100000610|0000 = -620911306099726, 

5 (10 4 ) = 424|00000553 + 790|00000298|0000 = 790042402980553, 

3 (-10 4 ) = 424|00000553 - 790|00000298|0000 = -789957602979447. 

The pointwise products are 

/i(10 4 ) = /(10 4 ) 5 (10 4 ) = 490686413831542917850889971522, 

h(-l0 4 ) = /(-10 4 ). 9 (-10 4 ) = 490493607029377239842510331522. 

The even and odd coefficients of h are then read off from 

/i (0) (10 8 ) = (h(!0 4 ) + /i(-10 4 ))/2 = 490590|01043046|00788467|00151522, 

10 4 /i (1) (10 8 ) = (h(!0 4 ) - fr(-10 4 ))/2 = 964034|01082839|00418982|0000. 

3.4. Four evaluation points. We take N — \(2b + e) /4] , and evaluate at y = 2 N 



2,2 and —2 . The structure of the algorithm is the same as that of ^2.4 



and we omit the details. The reconstruction algorithm of ^ 3.2 must be used twice: 
first on h^(2 2N ) and 2 2N(L - 1 '>h^ (2~ 2N ) to recover the even-index coefficients ofh, 
and then on h^(2 2N ) and 2 2N ^ L ~ 2 " > h^{2~ 2N ) to recover the odd-index coefficients. 

Proposition 10. The 'four-point' Kronecker substitution reduces the problem of 
computing h = fg to four multiplications of integers of length [(26 + e) /4] (L — 1) + 
b, plus packing /unpacking overhead of 0((2b + e)L). 

Example 11. Continuing with the running example, we put N = 2. For / we have 
/(10 2 ) = 887|0274 + 621|0610|00 = 629931274, 
/(-10 2 ) = 887|0274 - 621|0610|00 = -612190726, 
10 6 /(10 -2 ) = 274|0887|00 + 610|0621 = 280189321, 
10 6 /(-10 -2 ) = 274|0887|00 - 610|0621 = 267988079, 
and for g we have 

5 (10 2 ) = 424|0553 + 790|0298|00 = 794270353, 
5 (-10 2 ) = 424|0553 - 790|0298|00 = -785789247, 
10 6 g(10" 2 ) = 553|0424|00 + 298|0790 = 556023190, 
10 6 5 (-10~ 2 ) = 553|0424|00 - 298|0790 = 550061610. 
The pointwise products are 

^(lO 2 ) = 500335735365719722, 
/i(-10 2 ) = 481052889603923322, 
10 12 /i(10~ 2 ) = 155791760066353990, 
\Q 12 h(-\Q- 2 ) = 147409954195547190. 
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Then we obtain 

h(°)(10 4 ) = {h(l0 2 ) + h(-\Q 2 ))/2 = 49|0694|3124|8482|1522, 

10 12 /i (0) (10" 4 ) = (10 12 /i(l(T 2 ) + 10 12 /i(-10~ 2 ))/2 = 15|1600|8571|3095|0590, 

from which we recover the even-index coefficients of h, using the reconstruction 
algorithm from §3.2| The odd-index coefficients are similarly found from 

10 2 /i (1) (10 4 ) = O(10 2 ) ~ /i(-10 2 ))/2 = 96|4142|2880|8982|00, 

10 10 fr (1) (KT 4 ) = (10 12 h(lQ- 2 ) - 10 12 /i(-10~ 2 ))/2 = 41|9090|2935|4034|00. 



4. Example timings 

The author implemented the algorithms in C for the case of multiplication in 
(Zi/nZi)[x], where n fits into a single machine word. More precisely, the implemen- 
tation first lifts the input polynomials to Z [x] , multiplies in Z [x] using one of the 
algorithms of f|3] and then reduces the result modulo n. The underlying integer 
arithmetic is performed by GMP's low- level 'mpn' routines. The code is freely 
available under the GNU General Public License (GPL) from the author's web site, 
http : //math . harvard . edu/^dmharvey/. 

The timing data shown below were obtained on a 1.8GHz 64-bit AMD Opteron 
machine, kindly supplied by William Stein (funded by NSF grant No. 0555776). 
Both our code and GMP 4.2.1 were compiled using gec 4.1.2, with the -02 optimi- 
sation flag. We also used Pierrick Gaudry's AMD patch for GMP, which improves 
the performance of GMP on the Opteron. 

Figure [T] shows the relative performance of the three new algorithms (*3.2 {3.3 



{ 3.4 1 compared to the standard Kronecker substitution ({ 3.1 1, where n is a random 
4-bit modulus. Figure [2] is the same, but for a 48-bit modulus. We note several 
interesting features of the graphs: 

• On both graphs, the four curves converge towards 1 as the degree grows. 
This reflects the asymptotically quasilinear running time of the underlying 
integer multiplication routine, as discussed in £jl] 

• The most impressive region is in Figure [2j between degrees roughly 100 and 
5000. In this range, the four-point variant is almost twice as fast as the 
standard Kronecker substitution. 

• On both graphs, the negated variant has better performance than the recip- 
rocal variant (although the difference is marginal in the 48-bit case). This 
is due to the added overhead of the complicated reconstruction algorithm 



of S3. 2 



• The new algorithms gain more over the standard Kronecker substitution 
in the 48-bit modulus case than in the 4-bit modulus case. This occurs 
because the packing/unpacking overhead takes up a larger proportion of 
the total time in the 4-bit case. 

• For sufficiently small degree, the new algorithms are inferior to the standard 
Kronecker substitution, due to packing/unpacking overhead. 

For reference, we also compared the performance of our code on the same ma- 
chine to two well-known systems, Magma (version 2.13-5) and NTL (version 5.4.1). 
The latter has specialised routines for arithmetic on polynomials with word-sized 
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coefficients (the zz_pX class). Figure [3] compares our implementation of the stan- 
dard Kronecker substitution against both Magma and NTL for a 4-bit modulus, 
and Figure [4] is for a 48-bit modulus. 
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